-
Notifications
You must be signed in to change notification settings - Fork 0
Automated Test: error-upsampling-race-condition #318
Conversation
…(#94376) Part of the Error Upsampling project: https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link Events-stats API will now check if all projects in the query are allowlisted for upsampling, and convert the count query to a sum over `sample_weight` in Snuba, this is done by defining a new SnQL function `upsampled_count()`. I noticed there are also eps() and epm() functions in use in this endpoint. I considered (and even worked on) also supporting swapping eps() and epm() which for correctness should probably also not count naively and use `sample_weight`, but this caused some complications and since they are only in use by specific dashboard widgets and not available in discover I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance - Separate upsampling eligibility check from query transformation for better optimization - Remove unnecessary null checks in upsampled_count() function per schema requirements - Add cache invalidation utilities for configuration management This improves performance during high-traffic periods by avoiding repeated expensive allowlist lookups while maintaining data consistency.
📝 WalkthroughWalkthroughThis pull request introduces error upsampling functionality for Snuba queries, enabling weighted counting of errors from allowlisted projects. It includes a new helper module for eligibility checks and query column transformations, integrates these utilities into the organization events stats endpoint, adds an upsampled_count function to the discover dataset, and provides comprehensive test coverage. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Endpoint as organization_events_stats Endpoint
participant ErrUpsampling as Error Upsampling Helper
participant Transform as Column Transformer
participant SnubaQuery as Snuba Query Executor
participant Dataset as Discover Dataset
Client->>Endpoint: Request stats (organization, projects)
Endpoint->>ErrUpsampling: Check eligibility via is_errors_query_for_error_upsampled_projects
ErrUpsampling->>ErrUpsampling: Check cache (org_id, project_ids)
alt Cache miss
ErrUpsampling->>ErrUpsampling: Validate with dataset & query context
ErrUpsampling->>ErrUpsampling: Fetch allowlist from options
ErrUpsampling->>ErrUpsampling: Cache result for 60s
end
ErrUpsampling-->>Endpoint: eligibility decision (boolean)
alt Upsampling enabled
Endpoint->>Transform: transform_query_columns_for_error_upsampling
Transform->>Transform: Convert count() to upsampled_count() as count
Transform-->>Endpoint: transformed_columns
Endpoint->>SnubaQuery: Execute with transformed columns
else Upsampling disabled
Endpoint->>SnubaQuery: Execute with original columns
end
SnubaQuery->>Dataset: Resolve upsampled_count function
Dataset->>Dataset: Apply toInt64(sum(sample_weight)) aggregation
Dataset-->>SnubaQuery: Weighted error counts
SnubaQuery-->>Endpoint: Query results
Endpoint-->>Client: Event stats response
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@src/sentry/api/endpoints/organization_events_stats.py`:
- Around line 218-222: The upsampling eligibility check currently calls
is_errors_query_for_error_upsampled_projects(snuba_params, organization,
dataset, request) using the outer dataset, which can be incorrect when
_get_event_stats is called with a different dataset; update the call to use
scoped_dataset instead (i.e.,
is_errors_query_for_error_upsampled_projects(snuba_params, organization,
scoped_dataset, request)) so the check matches the actual dataset being queried
and preserves correct behavior across split paths.
In `@src/sentry/api/helpers/error_upsampling.py`:
- Around line 27-38: The cache key uses Python's process-randomized hash
(cache_key built from hash(tuple(sorted(snuba_params.project_ids)))), which
breaks cross-worker cache/invalidation; replace the hashed portion with a
deterministic representation of the sorted project IDs (e.g., join the sorted
snuba_params.project_ids into a stable string) when building cache_key in this
module (refer to cache_key, snuba_params.project_ids, organization.id), and make
the same change where hashing is currently used (also update
invalidate_upsampling_cache usage to use the same deterministic key generation
and ensure _are_all_projects_error_upsampled remains the eligibility check).
In `@src/sentry/testutils/factories.py`:
- Around line 344-357: The _set_sample_rate_from_error_sampling function
currently skips zero values because it checks `if client_sample_rate:` and uses
broad excepts; change the presence check to `client_sample_rate is not None` so
`"0"`/0 are accepted, and replace the blanket `except Exception` blocks with
narrow exception handling when parsing (catch ValueError and TypeError) around
`float(client_sample_rate)`; keep the initial retrieval via
normalized_data.get("contexts", {}).get("error_sampling",
{}).get("client_sample_rate") and on successful parse set
normalized_data["sample_rate"] to the parsed float.
In `@tests/snuba/api/endpoints/test_organization_events_stats.py`:
- Around line 3557-3603: Tests leak allowlist eligibility across runs because
is_errors_query_for_error_upsampled_projects caches results; clear that cache in
setUp() before applying mock_options so each test starts fresh. Locate the setUp
method in the test class and add a cache invalidation call for
is_errors_query_for_error_upsampled_projects (for example use
is_errors_query_for_error_upsampled_projects.cache_clear() or the module's
provided invalidation helper) before any mock_options mutations so cached
org+project_id results cannot affect subsequent tests.
🧹 Nitpick comments (2)
src/sentry/api/helpers/error_upsampling.py (1)
35-35: Drop or use the unusedorganizationparameter.
organizationis unused in_are_all_projects_error_upsampled(Ruff ARG001). Consider removing it from the signature and call sites, or using it for per-org allowlist selection if intended.✅ Suggested adjustment
-def _are_all_projects_error_upsampled( - project_ids: Sequence[int], organization: Organization -) -> bool: +def _are_all_projects_error_upsampled(project_ids: Sequence[int]) -> bool: @@ - is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids, organization) + is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids)Also applies to: 43-64
src/sentry/search/events/datasets/discover.py (1)
1041-1052: Avoid unused lambda arg warning inupsampled_count.Ruff flags
argsas unused; renaming keeps lint clean.✅ Suggested adjustment
- snql_aggregate=lambda args, alias: Function( + snql_aggregate=lambda _args, alias: Function(
| # Early upsampling eligibility check for performance optimization | ||
| # This cached result ensures consistent behavior across query execution | ||
| should_upsample = is_errors_query_for_error_upsampled_projects( | ||
| snuba_params, organization, dataset, request | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use scoped_dataset for upsampling eligibility.
_get_event_stats is invoked with different datasets during split paths, but the eligibility check uses the outer dataset. This can skip upsampling on the actual dataset being queried. Use scoped_dataset here.
✅ Suggested fix
- should_upsample = is_errors_query_for_error_upsampled_projects(
- snuba_params, organization, dataset, request
- )
+ should_upsample = is_errors_query_for_error_upsampled_projects(
+ snuba_params, organization, scoped_dataset, request
+ )🤖 Prompt for AI Agents
In `@src/sentry/api/endpoints/organization_events_stats.py` around lines 218 -
222, The upsampling eligibility check currently calls
is_errors_query_for_error_upsampled_projects(snuba_params, organization,
dataset, request) using the outer dataset, which can be incorrect when
_get_event_stats is called with a different dataset; update the call to use
scoped_dataset instead (i.e.,
is_errors_query_for_error_upsampled_projects(snuba_params, organization,
scoped_dataset, request)) so the check matches the actual dataset being queried
and preserves correct behavior across split paths.
| cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" | ||
|
|
||
| # Check cache first for performance optimization | ||
| cached_result = cache.get(cache_key) | ||
| if cached_result is not None: | ||
| return cached_result and _should_apply_sample_weight_transform(dataset, request) | ||
|
|
||
| # Cache miss - perform fresh allowlist check | ||
| is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids, organization) | ||
|
|
||
| # Cache for 60 seconds to improve performance during traffic spikes | ||
| cache.set(cache_key, is_eligible, 60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use deterministic cache keys for cross-worker consistency.
hash() is process-randomized, so cache hits and invalidation won’t line up across workers. This makes invalidate_upsampling_cache ineffective in distributed deployments. Use a stable key based on sorted IDs.
✅ Suggested adjustment
+def _make_upsampling_cache_key(organization_id: int, project_ids: Sequence[int]) -> str:
+ project_key = ",".join(map(str, sorted(project_ids)))
+ return f"error_upsampling_eligible:{organization_id}:{project_key}"
+
def is_errors_query_for_error_upsampled_projects(
snuba_params: SnubaParams,
organization: Organization,
dataset: ModuleType,
request: Request,
) -> bool:
@@
- cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"
+ cache_key = _make_upsampling_cache_key(organization.id, snuba_params.project_ids)
@@
def invalidate_upsampling_cache(organization_id: int, project_ids: Sequence[int]) -> None:
@@
- cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}"
+ cache_key = _make_upsampling_cache_key(organization_id, project_ids)
cache.delete(cache_key)Also applies to: 73-74
🤖 Prompt for AI Agents
In `@src/sentry/api/helpers/error_upsampling.py` around lines 27 - 38, The cache
key uses Python's process-randomized hash (cache_key built from
hash(tuple(sorted(snuba_params.project_ids)))), which breaks cross-worker
cache/invalidation; replace the hashed portion with a deterministic
representation of the sorted project IDs (e.g., join the sorted
snuba_params.project_ids into a stable string) when building cache_key in this
module (refer to cache_key, snuba_params.project_ids, organization.id), and make
the same change where hashing is currently used (also update
invalidate_upsampling_cache usage to use the same deterministic key generation
and ensure _are_all_projects_error_upsampled remains the eligibility check).
| def _set_sample_rate_from_error_sampling(normalized_data: MutableMapping[str, Any]) -> None: | ||
| """Set 'sample_rate' on normalized_data if contexts.error_sampling.client_sample_rate is present and valid.""" | ||
| client_sample_rate = None | ||
| try: | ||
| client_sample_rate = ( | ||
| normalized_data.get("contexts", {}).get("error_sampling", {}).get("client_sample_rate") | ||
| ) | ||
| except Exception: | ||
| pass | ||
| if client_sample_rate: | ||
| try: | ||
| normalized_data["sample_rate"] = float(client_sample_rate) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle zero sample rates and narrow exception handling.
if client_sample_rate: skips "0"/0 and the broad except Exception: pass triggers Ruff BLE001/S110. Consider checking for is not None and catching only parsing errors.
✅ Suggested adjustment
def _set_sample_rate_from_error_sampling(normalized_data: MutableMapping[str, Any]) -> None:
"""Set 'sample_rate' on normalized_data if contexts.error_sampling.client_sample_rate is present and valid."""
- client_sample_rate = None
- try:
- client_sample_rate = (
- normalized_data.get("contexts", {}).get("error_sampling", {}).get("client_sample_rate")
- )
- except Exception:
- pass
- if client_sample_rate:
- try:
- normalized_data["sample_rate"] = float(client_sample_rate)
- except Exception:
- pass
+ contexts = normalized_data.get("contexts") or {}
+ if isinstance(contexts, Mapping):
+ error_sampling = contexts.get("error_sampling") or {}
+ if isinstance(error_sampling, Mapping) and "client_sample_rate" in error_sampling:
+ client_sample_rate = error_sampling.get("client_sample_rate")
+ try:
+ normalized_data["sample_rate"] = float(client_sample_rate)
+ except (TypeError, ValueError):
+ pass🧰 Tools
🪛 Ruff (0.14.14)
[error] 351-352: try-except-pass detected, consider logging the exception
(S110)
[warning] 351-351: Do not catch blind exception: Exception
(BLE001)
[error] 356-357: try-except-pass detected, consider logging the exception
(S110)
[warning] 356-356: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@src/sentry/testutils/factories.py` around lines 344 - 357, The
_set_sample_rate_from_error_sampling function currently skips zero values
because it checks `if client_sample_rate:` and uses broad excepts; change the
presence check to `client_sample_rate is not None` so `"0"`/0 are accepted, and
replace the blanket `except Exception` blocks with narrow exception handling
when parsing (catch ValueError and TypeError) around
`float(client_sample_rate)`; keep the initial retrieval via
normalized_data.get("contexts", {}).get("error_sampling",
{}).get("client_sample_rate") and on successful parse set
normalized_data["sample_rate"] to the parsed float.
| def setUp(self): | ||
| super().setUp() | ||
| self.login_as(user=self.user) | ||
| self.authed_user = self.user | ||
|
|
||
| self.day_ago = before_now(days=1).replace(hour=10, minute=0, second=0, microsecond=0) | ||
|
|
||
| self.project = self.create_project() | ||
| self.project2 = self.create_project() | ||
| self.user = self.create_user() | ||
| self.user2 = self.create_user() | ||
|
|
||
| # Store some error events with error_sampling context | ||
| self.store_event( | ||
| data={ | ||
| "event_id": "a" * 32, | ||
| "message": "very bad", | ||
| "type": "error", | ||
| "exception": [{"type": "ValueError", "value": "Something went wrong 1"}], | ||
| "timestamp": (self.day_ago + timedelta(minutes=1)).isoformat(), | ||
| "fingerprint": ["group1"], | ||
| "tags": {"sentry:user": self.user.email}, | ||
| "contexts": {"error_sampling": {"client_sample_rate": 0.1}}, | ||
| }, | ||
| project_id=self.project.id, | ||
| ) | ||
| self.store_event( | ||
| data={ | ||
| "event_id": "b" * 32, | ||
| "message": "oh my", | ||
| "type": "error", | ||
| "exception": [{"type": "ValueError", "value": "Something went wrong 2"}], | ||
| "timestamp": (self.day_ago + timedelta(hours=1, minutes=1)).isoformat(), | ||
| "fingerprint": ["group2"], | ||
| "tags": {"sentry:user": self.user2.email}, | ||
| "contexts": {"error_sampling": {"client_sample_rate": 0.1}}, | ||
| }, | ||
| project_id=self.project2.id, | ||
| ) | ||
| self.wait_for_event_count(self.project.id, 1) | ||
| self.wait_for_event_count(self.project2.id, 1) | ||
|
|
||
| self.url = reverse( | ||
| "sentry-api-0-organization-events-stats", | ||
| kwargs={"organization_id_or_slug": self.project.organization.slug}, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clear the error‑upsampling cache between tests to avoid order‑dependent behavior.
is_errors_query_for_error_upsampled_projects caches allowlist eligibility by org+project_ids. Since each test mutates the allowlist via mock_options, a cached result can leak across tests and flip expectations. Clear or invalidate the cache in setUp() to keep these tests deterministic.
Suggested fix
+from sentry.api.helpers.error_upsampling import invalidate_upsampling_cache
+
class OrganizationEventsStatsErrorUpsamplingTest(APITestCase, SnubaTestCase):
endpoint = "sentry-api-0-organization-events-stats"
def setUp(self):
super().setUp()
self.login_as(user=self.user)
self.authed_user = self.user
@@
self.project = self.create_project()
self.project2 = self.create_project()
@@
self.url = reverse(
"sentry-api-0-organization-events-stats",
kwargs={"organization_id_or_slug": self.project.organization.slug},
)
+
+ # Avoid allowlist cache bleed across tests
+ invalidate_upsampling_cache(
+ self.project.organization.id, [self.project.id, self.project2.id]
+ )🤖 Prompt for AI Agents
In `@tests/snuba/api/endpoints/test_organization_events_stats.py` around lines
3557 - 3603, Tests leak allowlist eligibility across runs because
is_errors_query_for_error_upsampled_projects caches results; clear that cache in
setUp() before applying mock_options so each test starts fresh. Locate the setUp
method in the test class and add a cache invalidation call for
is_errors_query_for_error_upsampled_projects (for example use
is_errors_query_for_error_upsampled_projects.cache_clear() or the module's
provided invalidation helper) before any mock_options mutations so cached
org+project_id results cannot affect subsequent tests.
This pull request was automatically created by
@coderabbitai/e2e-reviewer.Batch created pull request.
Summary by CodeRabbit
Release Notes